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ABSTRACT 

The Integrated Microbial Genomes (IMG) system 
serves as a community resource for comparative 
analysis of publicly available genomes in a compre- 
hensive integrated context. IMG integrates publicly 
available draft and complete genomes from all three 
domains of life with a large number of plasmids 
and viruses. IMG provides tools and viewers for 
analyzing and reviewing the annotations of genes 
and genomes in a comparative context. IMG's data 
content and analytical capabilities have been con- 
tinuously extended through regular updates since 
its first release in March 2005. IMG is available at 
http://img.jgi.doe.gov. Companion IMG systems 
provide support for expert review of genome anno- 
tations (IMG/ER: http://img.jgi.doe.gov/er), teaching 
courses and training in microbial genome analysis 
(IMG/EDU: http://img.jgi.doe.gov/edu) and analysis 
of genomes related to the Human Microbiome 
Project (IMG/HMP: http://www.hmpdacc-resources 
.org/img_hmp). 



INTRODUCTION 

The Integrated Microbial Genomes (IMG) system inte- 
grates publicly available draft and complete microbial 
genomes from all three domains of life with a large 
number of plasmids and viruses. IMG employs NCBFs 
RefSeq resource (1) as its main source of public genome 
sequence data, and 'primary' annotations consisting of 
predicted genes and protein products. For every genome, 
IMG records its primary genome sequence information 



from RefSeq including its organization into chromosomal 
replicons (for finished genomes) and scaffolds and/or 
contigs (for draft genomes), together with predicted 
protein-coding sequences (CDSs), some RNA-coding 
genes and protein product names that are provided by 
the genome sequence centres. 

IMG's data integration pipeline associates every 
genome with metadata from GOLD (2), and fills in add- 
itional information potentially missing from the RefSeq 
files such as CRISPR repeats (3), signal peptides 
computed using SignalP (4) and transmembrane helices 
computed using TMHMM (5). Missing RNAs are 
identified using tRNAS-can-SE-1.23 (6) for tRNAs, in 
house developed HMMs for rRNAs (7), and Rfam (8) 
and INFERNAL vl.O (9) for other small RNAs. Genes 
are associated with 'secondary' functional annotations 
and lists of related (e.g. homologue, paralogue) genes. 
IMG generated annotations consist of protein family 
and domain characterizations based on COG clusters 
and functional categories (10), Pfam (11), TIGRfam and 
TIGR role categories (12), InterPro domains (13), Gene 
Ontology (GO) terms (14) and KEGG Ortholog (KO) 
terms and pathways (15). 

The association of KEGG pathways with IMG 
genomes is based on the assignment of KEGG 
Orthology (KO) terms to IMG genes via a mapping of 
IMG genes to KEGG genes. The MetaCyc collection of 
pathways (16) is also available in IMG, whereby the asso- 
ciation of MetaCyc pathways with IMG genomes is based 
on correlating enzyme EC numbers in MetaCyc reactions 
with EC numbers associated with IMG genes via KO 
terms. Genes are further characterized using an IMG 
native collection of generic (protein cluster-independent) 
functional roles called IMG terms that are defined by 
their association with generic (organism-independent) 
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functional hierarchies, called IMG pathways (17). IMG 
terms and pathways are specified by domain experts at 
DOE-JGI as part of the process of annotating specific 
genomes of interest, and are subsequently propagated to 
all the genomes in IMG using a rule based methodology 
(18). Transporter genes are linked to the Transport Classi- 
fication Database (19) based on their assignment to COG, 
Pfam or TIGRfam domains or IMG Terms that 
correspond to transporter families. 

For each gene, IMG provides lists of related (e.g. can- 
didate homologue, paralogue, orthologue) genes that are 
based on sequence similarities computed using NCBI 
BLASTp for protein coding genes and BLASTn for 
RNA genes. Such lists of genes can be filtered using 
percent identity, bit score and more stringent i?-values. 

IMG's data integration pipeline identifies gene fusions 
and conserved gene cassettes (putative operons). A fused 
gene (fusion) is defined as a gene that is formed from the 
composition (fusion) of two or more previously separate 
genes (20). Transposases and integrases, pseudogenes, and 
genes from draft genomes are not considered as putative 
fusion components in order to avoid false positives caused 
by gene fragmentation. A 'chromosomal cassette' is 
defined as a stretch of genes with intergenic distance 
smaller or equal to 300 bp (21), whereby the genes can 
be on the same or different strands of the chromosome. 
Chromosomal cassettes with a minimum size of two genes 
common in at least two separate genomes are defined as 
'conserved chromosomal cassettes'. The identification of 
common genes across organisms is based on three gene 
clustering methods, namely participation in COG, Pfam 
and IMG orthologue clusters (22). Correlation scores 
between different gene clusters, based on their co-existence 
on fusion events, conserved chromosomal cassettes and 
genomes, provide insights in their function (21). 

We review below IMG's data content growth and 
analysis tool extensions since the last published report 
on IMG (23). 

DATA CONTENT EXTENSIONS 

Genomics data 

The content of IMG has grown steadily since the first 
version released in March 2005, with IMG 3.4 (July 
2011) containing 3008 bacterial, archaeal and eukaryotic 
genomes, an increase of over 80% since August 2009 (23). 
IMG 3.4 also contains 2697 viral genomes and 1186 
plasmids that did not come from a specific microbial 
genome sequencing project bringing its total genome 
content to 6891 genomes with over 11.6 million genes 
(A Content History link on IMG's home page provides 
an overview of its content growth.). 

While archaeal, bacterial, plasmid and viral genomes 
are updated on a regular basis in IMG, the inclusion of 
eukaryotic genomes entails a more complex process (The 
integration process into IMG for eukaryotic genomes is 
described at: http://img.jgi.doe.gOv/w/doc/euks.html.) and 
is done at longer intervals. Since August 2009, about 
70 new eukaryotic genomes have been added to IMG, 
out of which 40 are fungal genomes. 



The 'Expert Review' version of IMG, IMG/ER (24), 
allows individual scientists or groups of scientists to 
review and curate the functional annotation of microbial 
genomes in the context of IMG's public genomes. 
Scientists can submit their private genome data sets into 
IMG ER (using password protected access) prior to their 
public release either with their original annotations 
or with annotations generated by IMG's annotation 
pipeline (18). Since August 2009, close to 750 private 
genomes have been reviewed and curated using IMG/ER. 

Genomes generated as part of the Human Microbiome 
Project (HMP) (25) and the Genome Encyclopedia of 
Bacterial and Archaea Genomes (GEBA) project (26) 
are of special interest. With the goal of characterizing 
microbial communities found at multiple human body 
sites, HMP has initially focused on the sequencing of 
reference genomes from both cultured and uncultured 
bacteria (25). Over 550 reference genomes sequenced as 
part of the HMP initiative, as well as over 1500 genomes 
associated with a human host and thus relevant to HMP, 
can be examined and analyzed using IMG/HMP 
(http://www.hmpdacc-resources.org/img_hmp/), which is 
provided as part of the HMP Data Analysis and 
Coordination Center (DACC). 

The aim of the GEBA is to fill systematically the 
sequencing gaps along the bacterial and archaeal 
branches of the tree of life. After a pilot project in 2009 
that generated complete genomes for about 100 organisms 
(26), the number of sequenced GEBA genomes has 
steadily increased and stands at 205 as of August 2011. 
GEBA genomes are available for analysis or download via 
a special purpose interface, IMG/GEBA (http://img.jgi 
.doe.gov/geba/), as soon as their annotation is completed 
at JGI, and before they are available in Genbank. 

Proteomics data 

Proteomics, transcriptomics, metabolomics, epigenomics 
and interactomics data are increasingly employed jointly 
with genomics data to refine our understanding of the 
functions of genes. Accordingly, these types of 'omics' 
data are gradually included into IMG. 

The first protein expression data sets included into 
IMG were generated as part of the Arthrobacter 
chlorophenolicus study conducted at the Oakridge 
National Laboratory (27). Subsequently, data sets from 
Cryptobacterium curtum and Brachybacterium faecium 
studies conducted at WR Wiley Environmental 
Molecular Sciences Laboratory, Instrument Development 
Laboratory, Pacific Northwest National Laboratory were 
also added to IMG. 

For a genome involved in a protein expression study, 
the experiments/samples are recorded together with the 
experimental conditions and the protein expression data 
organized per expressed gene. For each expressed gene, 
the number of observed peptides is recorded together 
with peptide sequences and the normalized coverage. 
The normalized coverage is defined as the coverage of 
an expressed gene in an experiment divided by the total 
coverage of the genes in that experiment, where coverage 
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for a gene is denned as of the number of all observed 
peptides for the gene divided by the size of the gene (28). 



the first 11 rules together with the number of genomes 
that are associated with a specific phenotype. 



Predicted phenotypes 

Phenotypes are broadly defined as an observable charac- 
teristic of an organism. The current list of phenotypes 
in IMG are predicted using a set of rules based on 
IMG's native collection of pathways. 

Many physiological functions require the coordinated 
action of several gene products, which can be grouped 
into pathways, where genes function in a specific order. 
Pathways can be analyzed in the context of other 
pathways within the organism. For example, if an 
organism degrades cellulose to cellobiose outside the 
cell, it can only utilize cellulose as a carbon source if it 
also has a transport pathway for uptake of cellobiose and, 
within the cell, a metabolic pathway to gain energy from 
cellobiose. If all three steps are present, then the organism 
has the phenotype of Growth on cellulose via cellobiose. 
In some cases the presence or absence of only one pathway 
is required for a phenotype. There are also cases in 
which there are multiple possibilities and require 
multiple combinations of pathways. 

Phenotype prediction rules consist of AND-OR com- 
binations of IMG pathway assertions. There are currently 
56 rules to predict phenotypes grouped into categories 
and subcategories, as shown in Figure 1 which displays 



ANALYSIS TOOL EXTENSIONS 

Genome data analysis in IMG consists of operations 
involving genomes, genes and functions which can be 
selected, explored individually, and compared. The com- 
position of analysis operations is facilitated by genome, 
scaffold, gene and function 'carts' that handle lists of 
genomes, scaffolds, genes and functions, respectively. 



Data selection tools 

Genomes, genes and functions can be selected using 
browsers and search tools. Browsers allow users to select 
genomes and functions organized as alphabetical lists 
or using domain specific hierarchical classifications. 
Keyword search tools allow identifying genomes, genes 
and functions of interest using a variety of selection 
filters. Genomes can be also selected using a search tool 
which allows specifying conditions involving metadata 
attributes, such as temperature range, oxygen requirement 
or ecosystem, while genes can be also selected using 
BLAST search tools against various data sets. 

IMG's data selection tools have been extended in order 
to improve their efficiency and usability. For example, 



Predicted Phenotypes 



Rule ID * 


Name 


Category 


Category Value 


Description 


No. of Genomes w/ Phenotype 


00001 


L-histidine prototroph 


Metabolism 


Prototrophic 


Organism Is predicted to be able to 
synthesize L-histidine. 


52 


00002 


Aerobe 


Oxygen 
Requirement 


Aerobe 


Organism is predicted to be able to 
grow in the presence of air. 


255 


eaeaj 


L-lysine prototroph 


Metabolism 


Prototrophic 


Organism is predicted to be able to 
synthesize L-lysine. 


327 


00004 


Denltrifier 


Metabolism 


Denitnfying 


Organism is predicted to be able to 
reduce nitrate to nitrogen (N2) 


SB 


00005 


Use of nitrate as 
electron acceptor 


Metabolism 


Nitrate reducer 


Organism is predicted to be able to 
grow anaerooically with nitrate as 
electron acceptor 


516 










Organism is predicted to be able to 




00006 


Carton fixation 


Metabolism 


Carbon fixation 


use carbon dioxide as sole carbon 
source 


a 


00007 


L-lysine auxotroph 


Metabolism 


Auxotroph 


Organism is predicted to be unable 
to synthesize L-lysine. 


2322 


00010 


L-alanine prototroph 


Metabolism 


Prototrophic 


Organism is predicted to be able to 
synthesize L-alanine 


1775 


00011 


L-alanine auxotroph 


Metabolism 


Auxotroph 


Organism is predicted to be unable 
to synthesize L-alanine 


493 


00012 


L-aspartate prototroph 


Metabolism 


Prototrophic 


Organism is predicted to be able to 
synthesize L-aspartate 


1701 - 


00013 


L-aspartate auxotroph 


Metabolism 


Auxotroph 


Organism is predicted to be unable 
to synthesize L-aspartate 


612 


00014 


L-gjutamate prototroph 


Metabolism 


Prototrophic 


Organism is predicted to be able to 
synthesize L-glutamale 


2344 


00015 


L-glutamate auxotroph 


Metabolism 


Auxolroph 


Organism is predicted to be unable 
to synthesize L-glutamate 


204- 


00016 


L-phenyl alanine 
prototroph 


Metabolism 


Prototrophic 


Organism is predicted to be able to 
synthesize L-phenytalanine 


231 



Figure 1. A sample of rules for predicting phenotypes in IMG. 
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Figure 2. Genome browser and search tools. The 'Genome Browser' displays the genomes organized in a phylogenetic tree or (i) in a tabular list that 
can be configured by (ii) adding or removing genome, metadata or annotation specific columns, (iii) 'Genome Search' allows searching genomes on 
genome or metadata specific fields, (iv) A genome can be explored using a variety of browsing tools, searched for the presence of specific genes using 
BLAST, or downloaded. 



genomes can be selected using 'Genome Browser' or 
'Genome Search', as illustrated in Figure 2. 

The 'Genome Browser' displays the genomes organized 
in a phylogenetic tree or in a tabular format as illustrated 
in Figure 2(i). The tabular display of genomes has a 
dynamic layout, with columns than can be resized, 
reordered and sorted on content, configurable page 
display size, and an export capability for saving tables as 
Excel spreadsheets or tab delimited files. A 'Column 
Selector' allows to hide columns. The genome table can 
be also reconfigured by adding or removing genome, 
metadata or annotation specific columns, as illustrated 
in Figure 2(h). Note that the number of metadata attri- 
butes associated with genomes has increased substantially 
in the past few years, whereby the data for these attributes 
is collected from GOLD (2). 'Genome Search' allows 
searching genomes on genome or metadata specific 
fields, as illustrated in Figure 2(iii). 

Individual genomes can be explored using the 
'Organism Details' page which provides a variety of 
tools for browsing, searching for the presence of specific 
genes, or downloading genome data sets, as illustrated in 
Figure 2(iv). This page also provides information 



(metadata) on the genome together with various genome 
statistics of interest, such as the number of genes that are 
associated with KEGG, COG, Pfam, InterPro or enzyme 
information. Individual genes can be analyzed using the 
'Gene Details' page which includes Gene Information, 
Protein Information and Pathway Information tables, 
evidence for functional prediction, COG, Pfam and 
pre-computed homologues. 

Tabular and graphical displays, such as graphical 
viewers for the distribution of genes associated with 
COG, Pfam, TIGRfam and KEGG for each genome, 
have been extended in order to facilitate genome and 
gene exploration. Individual functional categories, such 
as COG, Pfam, TIGRfam, KEGG Orthology terms and 
pathways, can be explored using functional category 
specific browsers. 

New IMG tools provide support for examining protein 
expression data as illustrated in Figure 3. Protein expres- 
sion studies are listed on the 'Experiments Statistics' 
section of the 'IMG Statistics' page and are available 
on the 'Organism Details' page of the genome they are 
associated with. A protein expression study, such as 
'Impact of Phenolic Substrate and Growth Temperature 



Nucleic Acids Research, 2012, Vol. 40, Database issue D119 



Protein Expression Studies 



ill 



Study ID 



Study Name 



Expressed Peptides 
Genes Observed 



1 Impact of Phenolic Substrate and Growth Temperature on the Arrrnrooacter chlorophenolicus 



Protein Expression Experiments 

Study Description Impact of Phenolic Substrate and Growth Temperature on the Arthrobacter chlorophenolicus 

Experiments in this Study 



(ii) 



Export I Page 1 of 3 < first < prev 12 3 "frd> last » 
Select Page | Deselect Page 



Column Stita:' 



Sample Description Gene Count Peptide Count Total Coverage 1 Average Coverage Percent Observed Genes 4 



Wt 150 ppm 4-CP: 
GfowlhtemptCE:28: 
SSuqproteinJupe 
1: (Reference) 

WH50 pprnJ-CP 
Growth temp (C): 28: 
6.9 ua protein: Tube 
2: medicate 1) 



Protein Expression Data for Selected Samples 




(iii) 


Column 

Select 


Sctedoi J 


Select Page | Deselect Page 


Gene ID * 


Product Name 


COG function 


KEGG pathway 




643579894 


filamentation induced by cAMP protein Fic 


^-Function unknown 




r 


643579969 


cleoJcyuriCine 5 -triphosphate 
nuaeotiaohytfroiase Dut 


F - Nucleotide trgnspol and 
metabolism 


Pvrimidine metabolism 







Find Up/Down Regulated Genes 

vjw naif select 2 samples to idenrtnty genes m« are up < 
Reference 

• Use t as reference 

Use 2 as f eJerence 

Metre 

» losR=log2!que['; iteteiwze) 
RelDftr-2(query - relere nee K query * reference) 

Threshold 1 trJefaurt=l> 



Dirfererjce in expression between 2 samples 

Reference. 1 Query 2 



Cluster Samples 

vou ma.' select samples ana ausier mem Dasea on trie 
Prrjjimrty ri grouping indicates Vie relalrve <Je;ree Dtsirt 

Clu9i*tlrtg HefJiMl 




1 1 lo jR ie.tiCmT] 



KEGG Map: Pyrimidine metabolism 

Coloring based on abundance of genes (or sample: 1 
[red-high to green-low expression) 



(iv) 



PYRIMIDINE METABOLISM 



L- aspartate v 



^Pentose phosphate palhway^ — >Q PRPP 

■— *-o« — ■ 




Figure 3. Protein expression exploration tools, (i) 'Protein Expression Studies' are listed on the IMG Statistics page, with each study associated with 
(ii) a list of 'Protein Expression Experiments' (samples), (iii) Samples can be selected for further analysis, such as examining expressed genes of 

(iv) a single sample in the context of pathway, where enzymes are displayed with colours representing the level of expression for the associated genes. 

(v) Sample pairs can be compared in terms of genes up or down regulation, with the result of the comparison displayed as a histogram. 



on the Arthrobacter chlorophenolicus' study shown in 
Figure 3(i), is associated with a list of samples (experi- 
ments). Summaries for samples include a description, 
the number of associated genes, the peptide count and 
the total and average coverage for the sample (The total 
coverage is the sum of coverages for the genes in a sample, 
where the coverage for a gene consists of the count of 
its associated peptides divided by the size of the gene.), 
as illustrated in Figure 3(h). Samples can be selected 
for further analysis. Expressed genes of a single sample 
can be examined in the context of pathways, as illustrated 
in Figure 3(iv), whereby enzymes are displayed with 
colours representing the level of expression for the 
associated genes. Expressed genes of multiple samples 
can be also examined in the context of pathways, 
whereby enzymes are displayed with colours representing 
the percentage of samples with expressed genes associated 
with the enzymes. Samples (experiments) can be clustered 
based on coverage values for the genes expressed in 
each sample, with a choice of clustering methods, such 
as pairwise complete linkage and centroid linkage, 
and distance measure, such as Pearson correlation, 



Spearman's rank correlation and Euclidean distance. 
The result of clustering is displayed as a hierarchical tree 
of samples and a normalized heat map of coverage values 
for each gene for each sample. 

Sample pairs can be compared in terms of genes up or 
down regulation, with a threshold specified for the differ- 
ence in gene expression. The difference in expression is 
computed using either the logi? = log2(query/reference) 
or the RelDiff = 2(query — reference)/(query + reference) 
metric. The result of the comparison can be displayed as 
a histogram, as illustrated in Figure 3(v), or in a tabular 
format. This histogram can be used to identify and set 
thresholds for the search of over expressed or under 
expressed genes between any pair of selected conditions. 

The genomes, genes and functions that result from 
search operations are displayed as lists from which 
genomes, genes and functions can be selected for inclusion 
into the 'Genome Cart', 'Gene Cart' and 'Function Cart', 
respectively. These carts have been extended in order to 
facilitate the composition of analysis tools in IMG. 
Thus, genes selected in 'Gene Cart' can be added 
directly to 'Function Cart' via their associated functions, 
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such as COG, Pfam, TIGRfam. In a similar manner, func- 
tions selected in 'Function Cart' can be added directly 
to 'Gene Cart' via the genes associated with the selected 
functions, where the genes included into the 'Gene Cart' 
can be restricted to specific genomes. 

Comparative analysis tools 

Genomes can be compared in terms of gene content using 
the 'Phylogenetic Profiler' and 'Phylogenetic Profiler for 
Gene Cassettes' tools. The 'Phylogenetic Profiler' allows 
users to identify genes in a query genome in terms of 
presence or absence of homologues in other genomes. 
The 'Phylogenetic Profiler for Gene Cassettes' allows 
users to find genes that are part of a gene cassette in 
a query genome as well as part of related (conserved part 
of) gene cassettes in other genomes, whereby the result of 
such a search includes groups of collocated genes in each 
chromosomal cassette in the query genome that satisfy the 
search condition. More details on context analysis based 
on IMG's gene cassettes can be found in (22). 

Genomes can be compared in terms of functional 
capabilities using the 'Abundance Profile Overview' and 



'Function Profile' tools. The 'Abundance Profile 
Overview' allows users to compare the relative abundance 
of protein families (COGs, Pfams, TIGRfams) and func- 
tional families (enzymes) across selected genomes, 
whereby the results are displayed either as a heat map or 
a matrix, with the cells in the heat map and matrix linked 
to the list of genes assigned to a particular family in 
a genome. The 'Function Profile' is a selective version of 
the 'Abundance Profile Overview', with functions of 
interest first selected with the 'Function Cart'. 

The metabolic capabilities of genomes can be compared 
using the 'Abundance Profile Overview' and 'Function 
Profile' tools applied on enzymes involved in a pathway 
of interest. Alternatively, the metabolic capabilities of 
genomes can be compared in the context of KEGG 
pathways, as illustrated in Figure 4. Once a pathway 
is selected from the list of KEGG pathways via the 
KEGG option of the 'Find Functions' menu, as shown 
in Figure 4(i), the 'KEGG Pathway Details' lists the 
associated enzymes of KO terms, as illustrated in 
Figure 4(h). Genomes for comparison are selected from 
a phylogenetically organized list, with the comparison 



■mcj VA 




Figure 4. Comparative analysis tools, (i) A pathway is selected from the list of KEGG pathways via the KEGG option of the 'Find Functions' 
menu, and subsequently (ii) the 'KEGG Pathway Details' lists its associated enzymes and the list of genomes organized phylogenetically. (Hi) Once 
genomes are selected for comparison, the result is displayed in the context of the KEGG pathway map, with each enzyme number on the map 
coloured depending on the percentage of genomes with a gene associated with that enzyme, (iv) The 'Radial Phylogenetic Tree' is one of several tools 
provided for comparing genomes, and (v) allows comparing the BLAST hits of the genes of up to five user selected genomes to the genes of all the 
genomes in the database using a colour-coded hierarchical circular tree viewer. 
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result displayed on the KEGG pathway map, as illustrated 
in Figure 4(iii). Each enzyme number on the map is 
coloured depending on the percentage of genomes with 
a gene associated with that enzyme, whereby the tooltip 
for a coloured enzyme displays the number of these 
genomes. 

Genomes can be compared using two open source 
graphical viewers, 'Phylogenetic Distance Tree' and 
'Radial Phylogenetic Tree', available under the 
'Compare Genomes' main menu, as illustrated in 
Figure 4(iv). For both tools, genomes are selected for com- 
parison from a list of genomes similar to that shown in 
Figure 4(h). The 'Phylogenetic Distance Tree' computes 
the phylogenetic distance between genomes selected for 
comparison based on the 16S alignment derived from the 
SILVA database (29). For genes whose sequence is not 
included in the alignment the closest match is used, if the 
identify of it to the 16S gene of the IMG taxon is >97%. 
The distance tree is displayed using the Archaeopteryx 
tool (http://www.phylosoft.org/archaeopteryx/), which 
uses phyloXML for data exchange (30). Each node in 
the tree hyperlinked to the IMG genome page for 
that node. 

The 'Radial Phylogenetic Tree' tool originally 
developed for MG-RAST (31), allows comparing the 
BLAST hits of the genes of up to 5 user selected 
genomes to the genes of all the genomes in the database 
using a colour-coded hierarchical circular tree viewer. This 
viewer displays the BLAST hits at different taxonomic 
levels, with more statistics for the hits for each genome 
provided by hovering the mouse over the nodes of the tree. 

Genomes can be compared in terms of sequence conser- 
vation using VISTA tools (32), the Artemis comparison 
tool (33) and a 'Dotplot' tool which employs the program 
'Mummer' to generate dotplot diagrams between two 
genomes. 

In addition to the analysis tools available in IMG, 
IMG/ER provides tools for identifying and correcting an- 
notation anomalies, such as dubious protein product 
names, and for filling annotation gaps detected using 
IMG's comparative analysis tools, such as genes that 
may have been missed by gene prediction tools or genes 
without predicted functions (24). Gene annotations that 
result from expert review and curation are captured in 
IMG/ER as so called 'MylMG' annotations associated 
with individual scientist or group accounts, with curated 
genomes included into Genbank either as new submissions 
or as revisions of previously submitted data sets. 



FUTURE PLANS 

IMG's genome sequence data content is maintained 
through regular updates from public sequence data re- 
sources. Since proteomics, transcriptomics, metabolomics 
and other 'omics' data are increasingly employed to refine 
our understanding of the functions of genes, additional 
types of 'omics' data will be gradually included into 
IMG following a similar integration approach and 
analysis tools to those developed for protein expression 
data. 



IMG's integrated data framework allows assessing and 
improving the quality of genome annotations. Thus, the 
quality of gene models for genomes available in public 
resources is known to vary greatly depending on the 
quality of sequence and the software used for annotation. 
For example, an analysis conducted at JGI of the protein 
coding genes of microbial genes in Genbank indicates that 
~10% (over 1 million) of predicted protein-coding are 
erroneous: they are false positive genes, unidentified 
pseudogene fragments or genes with translational excep- 
tions, or have incorrectly predicted start sites. In order to 
improve the consistency of annotation and the quality of 
predicted genes, a project for the re-annotation of all 
public microbial genomes in IMG has been launched 
recently. This project relies on a gene quality assessment 
pipeline, GenePRIMP (34) that allows performing auto- 
mated correction of gene models including insertion of 
missed genes, extension of 'short' genes and identification 
of putative pseudogenes. 

The significant drop in the cost of sequencing has 
resulted in an exponential growth of new genome 
sequence data sets posing computational, data manage- 
ment and analytical challenges for the biological interpret- 
ation of these data sets. Furthermore, scientists are 
facing a data overload involving an increasing burden of 
analyzing a rapidly growing number of genomic data. 
These computational, data management and analytical 
challenges can be alleviated by synthesizing genomic 
data using the 'pangenome' conceptual abstractions (35). 
A pangenome consists of the core part of a species (i.e. the 
genes present in all of the sequenced strains or of all 
samples of a microbial community) and the variable part 
(the genes present in some but not all of the strains or 
samples). An experimental version of IMG has been 
extended with five pangenomes, as well as analysis tools 
and viewers that allow users to explore individual 
pangenomes and compare pangenomes and genomes. A 
public version of IMG containing pangenome data and 
analysis tools is expected to be released in the near future. 
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